{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Dataframes "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* Dataframes are a restricted sub-type of RDDs. \n",
    "* Restircing the type allows for more optimization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "* Dataframes store two dimensional data, similar to the type of data stored in a spreadsheet. \n",
    "   * Each column in a dataframe can have a different type.\n",
    "   * Each row contains a `record`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* Similar to, but not the same as, [pandas dataframes](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) and [R dataframes](http://www.r-tutor.com/r-introduction/data-frame)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2.3.0'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pyspark import SparkContext\n",
    "sc = SparkContext(master=\"local[4]\")\n",
    "sc.version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Populating the interactive namespace from numpy and matplotlib\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import sys\n",
    "\n",
    "from pyspark import SparkContext\n",
    "from pyspark.sql import SQLContext\n",
    "from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType\n",
    "%pylab inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pyspark.sql.context.SQLContext at 0x7f6a30c77e10>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Just like using Spark requires having a SparkContext, using SQL requires an SQLContext\n",
    "sqlContext = SQLContext(sc)\n",
    "sqlContext"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Constructing a DataFrame from an RDD of Rows\n",
    "Each Row defines it's own  fields, the schema is *inferred*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Row(age=19, name='John'),\n",
       " Row(age=23, name='Smith'),\n",
       " Row(age=18, name='Sarah')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# One way to create a DataFrame is to first define an RDD from a list of Rows \n",
    "some_rdd = sc.parallelize([Row(name=u\"John\", age=19),\n",
    "                           Row(name=u\"Smith\", age=23),\n",
    "                           Row(name=u\"Sarah\", age=18)])\n",
    "some_rdd.collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- age: long (nullable = true)\n",
      " |-- name: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# The DataFrame is created from the RDD or Rows\n",
    "# Infer schema from the first row, create a DataFrame and print the schema\n",
    "some_df = sqlContext.createDataFrame(some_rdd)\n",
    "some_df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pyspark.rdd.RDD'> <class 'pyspark.sql.dataframe.DataFrame'>\n",
      "some_df = [Row(age=19, name='John'), Row(age=23, name='Smith'), Row(age=18, name='Sarah')]\n",
      "some_rdd= [Row(age=19, name='John'), Row(age=23, name='Smith'), Row(age=18, name='Sarah')]\n"
     ]
    }
   ],
   "source": [
    "# A dataframe is an RDD of rows plus information on the schema.\n",
    "# performing **collect()* on either the RDD or the DataFrame gives the same result.\n",
    "print(type(some_rdd),type(some_df))\n",
    "print('some_df =',some_df.collect())\n",
    "print('some_rdd=',some_rdd.collect())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Defining the Schema explicitly\n",
    "The advantage of creating a DataFrame using a pre-defined schema allows the content of the RDD to be simple tuples, rather than rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- person_name: string (nullable = false)\n",
      " |-- person_age: integer (nullable = false)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# In this case we create the dataframe from an RDD of tuples (rather than Rows) and provide the schema explicitly\n",
    "another_rdd = sc.parallelize([(\"John\", 19), (\"Smith\", 23), (\"Sarah\", 18)])\n",
    "# Schema with two fields - person_name and person_age\n",
    "schema = StructType([StructField(\"person_name\", StringType(), False),\n",
    "                     StructField(\"person_age\", IntegerType(), False)])\n",
    "\n",
    "# Create a DataFrame by applying the schema to the RDD and print the schema\n",
    "another_df = sqlContext.createDataFrame(another_rdd, schema)\n",
    "another_df.printSchema()\n",
    "# root\n",
    "#  |-- age: binteger (nullable = true)\n",
    "#  |-- name: string (nullable = true)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Loading DataFrames from disk\n",
    "There are many maethods to load DataFrames from Disk. Here we will discuss three of these methods\n",
    "1. Parquet\n",
    "2. JSON (on your own)\n",
    "3. CSV  (on your own)\n",
    "\n",
    "In addition, there are API's for connecting Spark to an external database. We will not discuss this type of connection in this class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Loading dataframes from JSON files\n",
    "[JSON](http://www.json.org/) is a very popular readable file format for storing structured data.\n",
    "Among it's many uses are **twitter**, `javascript` communication packets, and many others. In fact this notebook file (with the extension `.ipynb` is in json format. JSON can also be used to store tabular data and can be easily loaded into a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2018-04-09 00:48:07--  https://mas-dse-open.s3.amazonaws.com/Moby-Dick.txt\n",
      "Resolving mas-dse-open.s3.amazonaws.com (mas-dse-open.s3.amazonaws.com)... 54.231.176.210\n",
      "Connecting to mas-dse-open.s3.amazonaws.com (mas-dse-open.s3.amazonaws.com)|54.231.176.210|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1257260 (1.2M) [text/plain]\n",
      "Saving to: ‘../../Data/Moby-Dick.txt.1’\n",
      "\n",
      "Moby-Dick.txt.1     100%[===================>]   1.20M  1.15MB/s    in 1.0s    \n",
      "\n",
      "2018-04-09 00:48:08 (1.15 MB/s) - ‘../../Data/Moby-Dick.txt.1’ saved [1257260/1257260]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget 'https://mas-dse-open.s3.amazonaws.com/Moby-Dick.txt' -P ../../Data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"name\":\"Michael\"}\n",
      "{\"name\":\"Andy\", \"age\":30}\n",
      "{\"name\":\"Justin\", \"age\":19}\n",
      "people is a <class 'pyspark.sql.dataframe.DataFrame'>\n",
      "+----+-------+\n",
      "| age|   name|\n",
      "+----+-------+\n",
      "|null|Michael|\n",
      "|  30|   Andy|\n",
      "|  19| Justin|\n",
      "+----+-------+\n",
      "\n",
      "root\n",
      " |-- age: long (nullable = true)\n",
      " |-- name: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# when loading json files you can specify either a single file or a directory containing many json files.\n",
    "path = \"../../Data/people.json\"\n",
    "!cat $path \n",
    "\n",
    "# Create a DataFrame from the file(s) pointed to by path\n",
    "people = sqlContext.read.json(path)\n",
    "print('people is a',type(people))\n",
    "# The inferred schema can be visualized using the printSchema() method.\n",
    "people.show()\n",
    "\n",
    "people.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Excercise: Loading csv files into dataframes\n",
    "\n",
    "Spark 2.0 includes a facility for reading csv files. In this excercise you are to create similar functionality using your own code.\n",
    "\n",
    "You are to write a class called `csv_reader` which has the following methods:\n",
    "\n",
    "* `__init__(self,filepath):` recieves as input the path to a csv file. It throws an exeption `NoSuchFile` if the file does not exist.\n",
    "* `Infer_Schema()` opens the file, reads the first 10 lines (or less if the file is shorter), and infers the schema. The first line of the csv file defines the column names. The following lines should have the same number of columns and all of the elements of the column should be of the same type. The only types allowd are `int`,`float`,`string`. The method infers the types of the columns, checks that they are consistent, and defines a dataframe schema of the form:\n",
    "```python\n",
    "schema = StructType([StructField(\"person_name\", StringType(), False),\n",
    "                     StructField(\"person_age\", IntegerType(), False)])\n",
    "```\n",
    "If everything checks out, the method defines a `self.` variable that stores the schema and returns the schema as it's output. If an error is found an exception `BadCsvFormat` is raised.\n",
    "* `read_DataFrame()`: reads the file, parses it and creates a dataframe using the inferred schema. If one of the lines beyond the first 10 (i.e. a line that was not read by `InferSchema`) is not parsed correctly, the line is not added to the Dataframe. Instead, it is added to an RDD called `bad_lines`.\n",
    "The methods returns the dateFrame and the `bad_lines` RDD."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Parquet files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* [Parquet](http://parquet.apache.org/) is a popular columnar format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* Spark SQL allows [SQL](https://en.wikipedia.org/wiki/SQL) queries to retrieve a subset of the rows without reading the whole file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* Compatible with HDFS : allows parallel retrieval on a cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* Parquet compresses the data in each column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Spark and Hive\n",
    "* Parquet is a **file format** not an independent database server.\n",
    "* Spark can work with the [Hive](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started) relational database system that supports the full array of database operations.\n",
    "* Hive is compatible with HDFS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Moby-Dick.txt\t namesAndFavColors.parquet  users.parquet    US_stations.tsv.gz\r\n",
      "Moby-Dick.txt.1  people.json\t\t    US_stations.tsv  Weather\r\n"
     ]
    }
   ],
   "source": [
    "dir='../../Data'\n",
    "parquet_file=dir+\"/users.parquet\"\n",
    "!ls $dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "../../Data/users.parquet\n",
      "+------+--------------+----------------+\n",
      "|  name|favorite_color|favorite_numbers|\n",
      "+------+--------------+----------------+\n",
      "|Alyssa|          null|  [3, 9, 15, 20]|\n",
      "|   Ben|           red|              []|\n",
      "+------+--------------+----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#load a Parquet file\n",
    "print(parquet_file)\n",
    "df = sqlContext.read.load(parquet_file)\n",
    "df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+--------------+\n",
      "|  name|favorite_color|\n",
      "+------+--------------+\n",
      "|Alyssa|          null|\n",
      "|   Ben|           red|\n",
      "+------+--------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df2=df.select(\"name\", \"favorite_color\")\n",
    "df2.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "drwxrwxrwx 1 jovyan staff 4096 Apr  9  2018 ../../Data/namesAndFavColors.parquet\r\n"
     ]
    }
   ],
   "source": [
    "outfilename=\"namesAndFavColors.parquet\"\n",
    "!rm -rf $dir/$outfilename\n",
    "df2.write.save(dir+\"/\"+outfilename)\n",
    "!ls -ld $dir/$outfilename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "A new interface object has been added in **Spark 2.0** called **SparkSession**. A spark session is initialized using a `builder`. For example\n",
    "```python\n",
    "spark = SparkSession.builder \\\n",
    "         .master(\"local\") \\\n",
    "         .appName(\"Word Count\") \\\n",
    "         .config(\"spark.some.config.option\", \"some-value\") \\\n",
    "         .getOrCreate()\n",
    "```\n",
    "\n",
    "Using a SparkSession a Parquet file is read [as follows:](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet):\n",
    "```python\n",
    "df = spark.read.parquet('python/test_support/sql/parquet_partitioned')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Lets have a look at a real-world dataframe\n",
    "\n",
    "This dataframe is a small part from a large dataframe (15GB) which stores meteorological data from stations around the world. We will read the dataframe from a zipped parquet file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "directory /home/jovyan/work/Sections/Data/Weather already exists\n",
      "removing /home/jovyan/work/Sections/Data/Weather/NY.parquet\n",
      "removing /home/jovyan/work/Sections/Data/Weather/NY.tgz\n"
     ]
    }
   ],
   "source": [
    "from os.path import split,join,exists\n",
    "from os import mkdir,getcwd,remove\n",
    "from glob import glob\n",
    "\n",
    "# create directory if needed\n",
    "\n",
    "notebook_dir=getcwd()\n",
    "data_dir=join(split(split(notebook_dir)[0])[0],'Data')\n",
    "weather_dir=join(data_dir,'Weather')\n",
    "\n",
    "if exists(weather_dir):\n",
    "    print('directory',weather_dir,'already exists')\n",
    "else:\n",
    "    print('making',weather_dir)\n",
    "    mkdir(weather_dir)\n",
    "\n",
    "file_index='NY'\n",
    "zip_file='%s.tgz'%(file_index) #the .csv extension is a mistake, this is a pickle file, not a csv file.\n",
    "old_files='%s/%s*'%(weather_dir,zip_file[:-3])\n",
    "for f in glob(old_files):\n",
    "    print('removing',f)\n",
    "    !rm -rf {f}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "wget https://mas-dse-open.s3.amazonaws.com/Weather/by_state/NY.tgz -P /home/jovyan/work/Sections/Data/Weather \n",
      "--2018-04-09 00:52:22--  https://mas-dse-open.s3.amazonaws.com/Weather/by_state/NY.tgz\n",
      "Resolving mas-dse-open.s3.amazonaws.com (mas-dse-open.s3.amazonaws.com)... 52.218.144.50\n",
      "Connecting to mas-dse-open.s3.amazonaws.com (mas-dse-open.s3.amazonaws.com)|52.218.144.50|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 23182008 (22M) [application/x-tar]\n",
      "Saving to: ‘/home/jovyan/work/Sections/Data/Weather/NY.tgz’\n",
      "\n",
      "NY.tgz              100%[===================>]  22.11M  1.47MB/s    in 15s     \n",
      "\n",
      "2018-04-09 00:52:37 (1.51 MB/s) - ‘/home/jovyan/work/Sections/Data/Weather/NY.tgz’ saved [23182008/23182008]\n",
      "\n",
      "-rwxrwxrwx 1 jovyan staff 23M Mar 16 20:25 /home/jovyan/work/Sections/Data/Weather/NY.tgz\n"
     ]
    }
   ],
   "source": [
    "command=\"wget https://mas-dse-open.s3.amazonaws.com/Weather/by_state/%s -P %s \"%(zip_file, weather_dir)\n",
    "print(command)\n",
    "!$command\n",
    "!ls -lh $weather_dir/$zip_file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NY.parquet/\n",
      "NY.parquet/_SUCCESS\n",
      "NY.parquet/part-00000-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00001-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00002-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00003-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00004-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00005-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00006-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00007-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00008-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00009-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00010-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00011-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00012-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n",
      "NY.parquet/part-00013-8342bcf4-7fc2-4183-8e11-aefdb4915fbb-c000.snappy.parquet\n"
     ]
    }
   ],
   "source": [
    "#extracting the parquet file\n",
    "!tar zxvf {weather_dir}/{zip_file} -C {weather_dir}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/jovyan/work/Sections/Data/Weather/NY.parquet\n",
      "+-----------+-----------+----+--------------------+-----+\n",
      "|    Station|Measurement|Year|              Values|State|\n",
      "+-----------+-----------+----+--------------------+-----+\n",
      "|USC00303452|       PRCP|1903|[00 7E 00 7E 00 7...|   NY|\n",
      "+-----------+-----------+----+--------------------+-----+\n",
      "only showing top 1 row\n",
      "\n"
     ]
    }
   ],
   "source": [
    "weather_parquet = join(weather_dir, zip_file[:-3]+'parquet')\n",
    "print(weather_parquet)\n",
    "df = sqlContext.read.load(weather_parquet)\n",
    "df.show(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------+----+-----------+\n",
      "|    station|year|measurement|\n",
      "+-----------+----+-----------+\n",
      "|USC00303452|1903|       PRCP|\n",
      "|USC00303452|1904|       PRCP|\n",
      "|USC00303452|1905|       PRCP|\n",
      "|USC00303452|1906|       PRCP|\n",
      "|USC00303452|1907|       PRCP|\n",
      "+-----------+----+-----------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#selecting a subset of the rows so it fits in slide.\n",
    "df.select('station','year','measurement').show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Summary\n",
    "* Dataframes are an efficient way to store data tables.\n",
    "* All of the values in a column have the same type.\n",
    "* A good way to store a dataframe in disk is to use a Parquet file.\n",
    "* Next: Operations on dataframes."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Slideshow",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {
    "height": "263px",
    "width": "252px"
   },
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
